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Multi-subject functional magnetic resonance imaging (fMRI) data 
has been increasingly used to study the population-wide relationship 
between human brain activity and individual biological or behav¬ 
ioral traits. A common method is to regress the scalar individual 
response on imaging predictors, known as a scalar-on-image (SI) re¬ 
gression. Analysis and computation of such massive and noisy data 
with complex spatio-temporal correlation structure is challenging. In 
this article, motivated by a psychological study on human affective 
feelings using fMRI, we propose a joint Ising and Dirichlet Process 
(Ising-DP) prior within the framework of Bayesian stochastic search 
variable selection for selecting brain voxels in high-dimensional SI re¬ 
gressions. The Ising component of the prior makes use of the spatial 
information between voxels, and the DP component groups the coef¬ 
ficients of the large number of voxels to a small set of values and thus 
greatly reduces the posterior computational burden. To address the 
phase transition phenomenon of the Ising prior, we propose a new an¬ 
alytic approach to derive bounds for the hyperparameters, illustrated 
on 2- and 3-dimensional lattices. The proposed method is compared 
with several alternative methods via simulations, and is applied to 
the fMRI data collected from the KLIFF hand-holding experiment. 

1. Introduction. Positive social contact is known to enhance human health 
and well-being, possibly because it helps to regulate humans’ emotional 
reactivity when facing negative stressors in daily life [Goan, Schaefer and 
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Davidson (2006); Coan, Beckes and Allen (2013); Coan (2010, 2011)]. Con¬ 
ventional studies on social contact primarily focus on its aggregated effect 
on an entire population. With the common belief that human behavior is 
controlled by individual mental decisions, which is affected by the imme¬ 
diate environment, it is desirable to investigate emotion regulation activity 
of the individual brain under different social interaction conditions. Toward 
this aim, the KLIFF hand-holding psychological experiment [Coan, Schaefer 
and Davidson (2006)] was conducted. In this experiment, 104 pairs—each 
pair consisting of a male and a female—of mentally and physically healthy 
young adults in various close relationships including friends and married 
couples were recruited from a larger representative longitudinal community 
sample [Allen et al. (2007)]. One participant of each pair was threatened 
with mild electric shock during a functional magnetic resonance imaging 
(fMRI) session while either holding a hand of a friend, holding a hand of 
a stranger or holding no hand at all, in three separate sessions, which rep¬ 
resent three different types of social interactions—positive and supportive 
social interaction with friends, general social interaction with strangers and 
no social interaction, respectively. At the end of each session, the subjects 
were asked to rate their feelings of arousal and valence [Russell (1980); Lang 
et al. (1993)] experienced during the experiment. Arousal and valence are 
the two dimensions in the framework of emotion fields, representing the ex¬ 
tent of excitement and pleasure experienced, respectively [see Bradley and 
Lang (1994) for more detailed explanation]. 

To investigate which areas in the brain are predictive of individual’s affec¬ 
tive feelings in the KLIFF study, we can construct a regression model using 
subjects’ emotion (arousal and valence) measurements as the response, and 
summaries of the fMRI images in the regions of interests (ROIs) as pre¬ 
dictors. This type of regression is often referred to as scalar-on-image (SI) 
regressions in the literature [Reiss et al. (2011); Huang et al. (2013); Gold¬ 
smith, Huang and Crainiceanu (2014)]. SI regressions with predictors from 
other imaging modalities, such as diffusion tensor imaging (DTI), have also 
been used in medical and scientific studies [e.g., Reiss et al. (2015)]. 

The SI regression model in the KLIFF study has several unique character¬ 
istics due to the features of fMRI data. First, the sample size is much smaller 
than the number of predictors, that is, the number of brain voxels (3D cubic 
volumes in the brain) in the ROIs, which is over 6000 in the KLIFF study. 
This is known as the “large p, small n” paradigm [West (2003)]. Second, 
there is rich spatial information between the predictors. Third, neighboring 
predictors are highly correlated and often have similar but weak effects on 
the response. Finally, as each voxel accounts for only a tiny area in the brain, 
it is very likely that the number of significant voxels is much larger than the 
sample size. The last two characteristics imply that even with all the true 
voxels being correctly selected, standard regression methods may still not 
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be applicable due to multicolinearity. It is therefore desirable to impose a 
certain degree of shrinkage or grouping of the regression coefficients so that 
predictors with similar values can be grouped together, and thus the effec¬ 
tive number of selected predictors is smaller than the sample size. Motivated 
by these considerations, in this article, we propose a Bayesian SI regression 
model that achieves simultaneous grouping and spatial selection of voxels 
that are predictive of individual responses. The key to our proposal is to 
define a joint Ising and Dirichlet Process (Ising-DP) prior for the regression 
parameters, within the framework of Bayesian stochastic search variable se¬ 
lection [SSVS; George and McCulloch (1993, 1997)]. The Ising component 
of the prior utilizes the spatial information between voxels to smooth the se¬ 
lection indicators of neighboring voxels, and the DP component groups the 
coefficients of voxels with similar effects to improve prediction power and 
also reduce the posterior computational burden. This method has scientific, 
statistical and computational advantages over several existing alternative 
priors. 

Bayesian inference has become increasingly popular in fMRI data analysis 
due to several attractive properties: first, the posterior inference offers direct 
probabilistic interpretation of the estimates; second, it eschews the multiple- 
comparison problem faced by classical inference; third, incorporating prior 
information is straightforward within the Bayesian framework. In particu¬ 
lar, Markov Random Fields priors, such as the Ising prior and the Potts 
prior, have been widely used to account for the spatial information between 
voxels [e.g., Gossl, Auer and Fahrmeir (2001); Woolrich et al. (2004); Penny, 
Trujillo-Barreto and Friston (2005); Bowman (2007); Bowman et al. (2008); 
Derado, Bowman and Kilts (2010); Ge et al. (2014)] and for meta-analysis 
[e.g., Kang et al. (2011); Yue, Lindquist and Loh (2012)]. Johnson et al. 
(2013) used a joint Dirichlet Process mixture and Potts prior to achieve 
simultaneous clustering and selection. Within the SSVS framework. Smith 
et al. (2003) and Smith and Fahrmeir (2007) used the Ising prior in the con¬ 
text of massive univariate general linear models [GLM, Friston et al. (1995)] 
for identifying brain regions activated by a stimulus. It is important to stress 
that the setting in Smith and colleagues is fundamentally different from the 
SI regression in this paper: the former only involves fMRI time series, with¬ 
out individual scalar outcome, and it deals with selecting and smoothing the 
coefficients from p one-dimensional regressions (one for each voxel), a setting 
broadly belonging to multiple testing; whereas our paper deals with variable 
selection from one p-dimensional regression, a much more challenging task. 

Within the SSVS but outside the fMRI literature, there is a stream of 
recent work on using the Ising prior to incorporate existing structure in¬ 
formation between variables under the “large p, small n” paradigm [e.g., 
Li and Zhang (2010); Stingo et al. (2011); Vannucci and Stingo (2011)]. 
Moreover, simultaneous selection and clustering in multiple regression was 
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discussed in Tadesse, Sha and Vannucci (2005), Kim, Tadesse and Vannucci 
(2006) and Dunson, Herring and Engel (2008), but none of those incorpo¬ 
rated existing structure between covariates. Another important but under¬ 
investigated issue is phase transition in the Ising model [for a review, see 
Stanley (1987)], which, in the context of variable selection, leads to a drastic 
change (from nearly none to nearly all) in the number of variables selected 
given an inhnitesimal change in the hyperparameters. And the difficulty and 
sensitivity in hyperparameter selection increases substantially as the degree 
of the underlying graph increases. Since the fMRI voxels naturally overlay 
a 3-dimensional lattice, it is crucial to select hyperparameters that avoid 
phase transition for valid inference and feasible computation. However, de¬ 
spite being intensively explored in statistical physics, phase transition and 
the consequent issue of hyperparameter selection has received relatively little 
attention in the literature of variable selection. Li and Zhang (2010) derived 
a ballpark estimate of the phase transition boundary for the Ising prior using 
mean field theory. But their derivation is solely based on the prior distri¬ 
bution and does not take into account the data or any prior knowledge of 
the predictors, and thus the resulting range of possible hyperparameters is 
often very wide. In this article we develop a new analytic approach to de¬ 
rive a tighter boundary of the hyperparameters based on the data and the 
posterior distribution, and illustrate it on 2- and 3-dimensional lattices. 

The rest of the article is organized as follows. Section 2 introduces the 
new Bayesian model and Section 3 develops an analytic approach to hy¬ 
perparameter selection. Posterior computation of the model is discussed in 
Section 4. Section 5 compares the proposed methods with several existing 
methods through simulations. In Section 6 we apply the proposed method 
to the KLIFF study to investigate the social regulation of human emotion. 
Section 7 concludes. 

2. The model. We formulate the problem via a standard multiple re¬ 
gression 

(1) Y = Xri + e, 

where Y is the n x I variable response, for example, the scalar arousal or 
valence measurement in the KLIFF study; X = (Xi,... ,Xp) is the n x p 
{p ^ n) matrix of spatially correlated neuroimaging covariates, for example, 
the magnitudes of the estimated hemodynamic response function (HRF) 
of the voxels in the two ROIs in the study; and e is the error term with 
e ~ N(0,cj^/n). To focus on the main message, we do not consider design 
variables, such as age and sex, which can be easily added to the regression. 

To select the voxels that are predictive of the response, we adopt the 
Bayesian SSVS approach that assumes the “spike-and-slab” type of mix¬ 
ture prior for the regression coefficients [Mitchell and Beauchamp (1988); 
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George and McCulloch (1993, 1997); Smith and Kohn (1996)]. Specifically, 
we define a latent indicator 7 j G { 0 , 1 } for each covariate that indicates 
whether this covariate is included in the model (i.e., whether a voxel is 
significantly predictive of the response). We let 

T]j = • 13j and f3j ^ G, 

where 13j represents the regression coefficient of predictor j once it is selected, 
and G is a prespecified probability distribution. Given 7 ^ and G, r]j are 
independent following a spike-and-slab prior 

(2) 7/j|(7j,G) ~ (1 - 7j)(5o + 7 jG, 

where 5o is a point mass at 0. Our goal is to propose a new joint Ising and 
DP (Ising-DP) prior, where an Ising prior is imposed on 7 ' = ( 71 ,... , 7 p)^ to 
incorporate spatial information between voxels, and, in parallel, a Bayesian 
nonparametric DP prior is imposed on G to achieve grouping of the regres¬ 
sion coefficients, as elaborated below. 

We represent the spatial structure among the fMRI voxels via a graph. 
Let i ~ j denote that i and j are neighboring voxels. Let £ = {(ji, J 2 ): 1 < 
ji ~ J 2 < p} be the set of all the neighboring pairs of voxels—the edge set 
of the underlying graph. Given £, let a = (oi,..., Op)' be a vector and B = 
( 5 jij 2 )pxp be a symmetric matrix of real numbers where = 0 for all 
{ 31 ^ 2 ) ^£- To incorporate the prior structural information into the model 
building process, we assume an Ising prior distribution for 7 [Li and Zhang 
( 2010)1 as the first component of the proposed prior: 

(3) Pr( 7 ) =exp{a '7 + yB 7 -V'(a,B)}, 

where '0(a, B) is the normalizing constant: ^/:(a, B) = log{^...j,g|Q exp(a '7 + 
7 'B 7 )}.IfB = 0, then i/'(a, B) = log(l + e“-’'), but in general there is no 
closed form for The Ising model is a binary Markov Random Fields model 
and encourages the formation of clusters of like-valued binary variables. 

The hyperparameters a control the sparsity of 7 . Since we are focused on 
2D and 3D lattices, which are regular graphs (i.e., each vertex has the same 
degree), we do not want to favor a priori the inclusion of any voxel. This is 
achieved by letting a = alp, where Ip = (1,1,..., 1)' G The hyperparam¬ 
eters represent the prior belief on the strength of coupling between 

the pairs of neighbors { 31 ^ 2 )^ and thus control the smoothness of 7 over £ 
given a, with larger leading to tighter coupling. When B = 0, the prior 
is the standard i.i.d. Bernoulli for each predictor [George and McCulloch 
(1993)]. Without specihc prior information of the strength of connection be¬ 
tween each pair of neighbors, it is natural to assume to be a constant 

b. Then (a, B) reduce to two hyperparameters (a, 6 ), which can be either 
pre-fixed or assumed to follow some hyperprior distributions. 
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The Ising prior smoothes the binary selection indicators, but not the 
regression coefficients. In structured high-dimensional settings like fMRI, 
neighboring covariates, often highly correlated, tend to have similar effects 
on the outcome. Intuitively, a certain degree of smoothing or grouping of the 
coefficients would improve the model fitting, especially when the effects of 
individual predictors are very weak. We achieve this by imposing a DP prior 
on G, G~ DP(q;,Go), with a precision parameter a and base measure Go 
[Ferguson (1973, 1974); Antoniak (1974)]. Following the sticking-breaking 
(SB) presentation [Sethuraman (1994)], G can be written as a weighted sum 
of an infinite number of point masses (atoms); 

OO 

h=l 

= w(,''~'Beta(l,a), 

k<h 

point mass at 6. It is clear from (4) that samples from a 
DP are discrete and the component weights Wh decrease exponentially in 
expectation. The spike-and-slab prior (2) for each r] can then be written as 
a mixture of an infinite number of point masses (at 0 and atoms randomly 
drawn from the base measure Go): 

OO 

(5) r]j \ ( 7 j, w, 0) ~ (1 - 7j)(5o + 7i WhK (0, 

h=l 

where 6 = (0i ,... ,6h, ■ ■ ■) and w = (u^i,..., Wh, ■ ■ ■)■ The clustering nature of 
the DP prior can be immediately seen from (5): it classifies the voxels into 
one cluster of voxels that have no effect on response, and several clusters of 
the remaining voxels, where the regression coefficients within each cluster 
are shrunk to be identical. The number of clusters increases automatically 
as the number of voxels under consideration, p, increases. The precision 
parameter a governs the number of active components and is assumed to 
follow a flexible hyper Gamma(l, 1) prior. And we assume the base measure 
Go = N(0,u^) with hyperparameter v. In this article, clustering per se is 
not the primary interest, rather clustering is a means of grouping similar 
coefficients. There is a clear scientific justihcation for grouping regression 
coefficients in this manner, as each predictive brain region usually contains 
a number of voxels that are of similar (and usually weak) effects on the 
outcome. Clustering also introduces substantial improvement in posterior 
computation because instead of sampling the coefficient for each voxel, one 
only need to sample the common coefficient for each cluster. 

Jointly, equations (3), (4) and (5) define the new Ising-DP spike-and-slab 
prior. 


(4) 

where 6 $ is a 
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3. Selection of hyperparameters. Selection of the hyperparameters a, b 
in the Ising prior is crucial for both inference and computational feasibility 
for high-dimensional data. A challenging feature of the Ising prior in the 
“large p” paradigm is the phase transition behavior in a graph with dimen¬ 
sion higher than 1 : certain combinations of the hyperparameters a,b lead 
to the selection of almost all variables and thus induce critical slowdown 
of the MCMC for posterior computation. This issue cannot be mitigated 
by simply replacing a and 6 by a hyper prior, because for a regular graph 
with even modest degree (say, 3), the range of hyperparameters that do not 
incur phase transition is narrow. If the domain of the prior is not carefully 
chosen, it is very likely that little weight is assigned to appropriate hyper¬ 
parameters, leading to poor posterior results, especially for data with low 
signal-to-noise ratio (SNR), such as fMRI data. Smith and Fahrmeir (2007) 
suggested to co-estimate the hyperparameters and the binary indicators in 
posterior computation. Their method relies on specifying a uniform prior 
between zero and a prespecihed maximum for the smoothing parameter b. 
However, if the maximum is specified outside the phase transition bounds, 
the resulting MCMC will still suffer from the critical slowdown. Therefore, 
finding these phase transition bounds is central to correct specification of 
hyperparameters for the Ising prior. 

Solely based on the prior distribution, Li and Zhang (2010), page 1205, 
used mean field approximations to derive a ballpark estimate of the phase 
transition boundary for the Ising prior defined on regular graphs, and illus¬ 
trated it on a hypertube with degree of 6 . However, because this approach 
does not take into account the data or any prior knowledge of selection rate, 
it often results in a very wide range of hyperparameters. The problem be¬ 
comes even more pronounced when the degree of the graph increases. Below 
we develop a new method to tighten the bounds on o and b based on the 
posterior distribution. 

The posterior conditional density of 7 given the rest of parameters is 
proportional to 



In high-dimensional settings, usually it is reasonable to a priori assume spar¬ 
sity, that is, the proportion of true predictors among the p candidates, vr, is 
much smaller than 1. Intuitively, in order to have only a small proportion of 
predictors being selected, the mode of C{'y) should be larger than C( 0 p) and 
attained at a 7 such that the number of nonzero 7 ’s is around vr • p, beyond 
which C( 7 ) should decrease fast as the number of nonzero 7 ’s increases. 
Below we form inequalities for a and b based on this intuition. 

When all the candidate voxels locate on a lattice, selected voxels give rise 
to the largest number of neighboring pairs when they form a square in two 
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dimensions or a cube in three dimensions. Therefore, we use squares (2D) 
or cubes (3D) to approximate the location of the selected vr • p voxels on the 
lattice. Let V = [(vr ■ where [c] denotes the largest integer no larger 

than c, and d is the dimension of the lattice, which equals either 2 or 3. For 
a square containing voxels, there are 4F^ — 6 V + 2 neighboring pairs; for 
a cubic containing voxels, there are 13F^ + 28 + 66 (F — 2) + 51(F — 2)^ 
neighboring pairs (derivations are given in Appendix A). 

3.1. Selection on two-dimensional lattice. We first discuss the two-di¬ 
mensional lattice. For selected voxels on a square, 

a '7 -I- 7^67 = {a-\- 86 ) F^ — 126F -|- 46. 

To achieve sparsity, this value needs to decrease fast as F increases, thus we 
must have a -|- 86 < 0. We also need the conditional density of selecting F^ 
voxels to be larger than the null model with zero voxel, that is, 

n 

i=\ 

('‘I 

< (a + Sb)V^ - 12bV + 4b- J^(Y; - X.:(/3 ■ 7))V2a^ 

i=l 

Since total variation of the observed F, ~ 

Xj(/3 • 7 ))^ is the sum of squared errors, and — 7 ))^ ~ na'^, 

then Er=i(^i - - Er=i(^* “ ' l)f ~ ^ 

is the determinant of coefficient in the linear regression of Y versus X. 
Then inequality ( 6 ) is reduced to 

(a + 86 )f2 - 126F + 46 > 

We now propose two ways to determine to further tighten the inequality. 

In the first method, we prespecify the value that we expect to achieve. 

Then given F from prior knowledge, obtain bounds on the parameters o and 
6 . For example, if we want at least 50% of variation of Y to be explained 
by the regression, and at most 5% of 1000 voxels to be selected, we may let 
= 50% and F = [\/OT] = 7, then the inequality becomes 49(a-|- 86 ) -846-1- 
46 > —nj2, that is, 3126-1-49a > —nj^. Consequently, the range of a and 6 is 
determined by two inequalities: —86 > a > [—njl — 3126)/49 and 6 < n/160. 
The second method is to approximate by a lower bound obtained based 
on the data: the maximum among all simple linear regressions of Y 
versus each single predictor X. We believe such a lower bound is an effective 
approximation for the problem under study for two reasons. First, by using 
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the DP prior, usually most of the selected voxels should have identical (3, 
effectively converting the multiple regression to a simple linear regression. 
Second, for fMRI data, spatially close voxels typically have very similar X 
values, and thus the value from regressing Y on multiple spatially close 
predictors is expected to be very similar to that from regressing Y versus a 
single predictor. 


3.2. Selection on three-dimensional lattice. Analogously, we can derive 
the range of a and b for a three-dimensional lattice. For voxels forming 
a cubic and V > 1, 


(7) 


a '7 + 7 'B 7 = (a + 266) (P - 2 )^ + 6 (a + 176) (P - 
+ 12(a -L 116)(P - 2) + 8 a -L 566. 


In order to avoid all predictors being selected, we need C{'y) < 0 to decrease 
fast as P increases after certain threshold. For simplicity, we only require 
C( 7 ) to be negative for the maximum possible P, that is, P = For 

example, in the KLIFF data, p is around 6600 in both ROIs, then P = 18 
and, consequently, a < —236. In addition, in order to avoid the null model, 
that is, no voxel being selected, we have 

( 8 ) 

Given the prespecihed and P, we can obtain the range of a and 6 sat¬ 
isfying this inequality. Again taking the KLIFF data, for example, n = 104, 
we want at most 1% voxels selected, and the expected R^ is 0.5. Then 
P = [66.7^/^] = 4, plug this value and R^ = 0.5 into the inequality ( 8 ), and 
we have a > —14.66 — 0.81. Combining the previously obtained inequality 
a < —236, it must be the case that —236 > —14.66 — 0.81, so that we have 
6 < 0.1. Therefore, for the KLIFF data analysis, we will choose a and 6 such 
that 6 < 0.1 and —236 > a> —14.66 — 0.81. 

One potential problem of using (7) to evaluate a '7 -|- 7 'B 7 in ( 8 ) is the 
overestimation of the number of neighboring pairs of selected voxels, espe¬ 
cially when the selected P is larger than 3, which can lead to a very tight 
range of 6 and a. We instead propose that as long as there is one pre¬ 
dictor whose posterior probability of being selected is larger than that of 
not selected, the posterior simulation will not be stuck at the null model. 
Therefore, we can just let o > implying 6 < 2 - 23 {^r‘^) that 

—236 > 27 ") ) • For one of the real data sets under study, the maximum R^ 

across all simple linear regressions is 0.10, then we have —236 > o > —5.8 
and 6 < 0.25. Given the derived range of hyperparameters, and with the be¬ 
lief that all the true predictors are tightly clustered together, we first choose 
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the largest possible b to induce the most spatial clustering effect; then given 
the value b, we choose the smallest a within the phase transition boundary 
to induce sparsity. Such a choice of a also brings computational advantage, 
because the computational cost of obtaining the regression coefficients de¬ 
creases with the number of selected predictors in each MCMC iteration. 
Here, we choose b = 0.2 and a = —4.5 as the hyperparameters for the Ising 
prior. 

3.3. Remarks. The above derivation suggests the following: first, the 
larger and the sample size n, and the smaller the degree of the underlying 
graph (i.e., the average number of neighbors of each candidate predictor), 
the wider the range of b; and second, the range of a depends on both b and 
the degree of the graph. Generally, for an Ising model built on a regular 
graph, given b, a larger degree of the graph leads to smaller a. These are 
consistent with a general understanding of the effect of prior distributions 
in Bayesian inference: when R^ and n are large, indicating a strong SNR 
and abundant data information, choice of prior is less crucial. On the other 
hand, if each predictor has many neighbors, then the positive part 7 'B 7 in 
the prior will give a strong preference to models with many spatially close 
predictors. Therefore, we need to use a smaller b in order not to impose 
a strong prior. This also explains, for fixed b, the larger the degree of the 
graph, the smaller a is required to induce a small prior odds of selecting a 
large number of predictors. 

The degrees of a 2D and 3D lattice are 8 and 26, respectively. Conse¬ 
quently, the range of hyperparameters a and b that avoids phase transition 
is much tighter in the latter than the former case. Indeed, in the real appli¬ 
cation, when we assume the Ising prior on a 3D lattice, the results are much 
more sensitive to the choice of a and b. In general, we find a larger degree 
of the underlying graph corresponds to substantially more difficult hyper¬ 
parameter selection and inference, consistent with the observation made in 
Li and Zhang (2010). Also, it is crucial to examine 7 ^ 67 . Nevertheless, 
when choosing the underlying graph, the concern of the degree of the graph 
should not outweigh the true physical structure. For example, in fMRI data, 
we prefer an Ising prior on a 3D lattice than on a 2D lattice, as the latter 
only accounts for the structure in one slice and ignores the true 3D structure 
between voxels. 

In Bayesian variable selection problems, choice of hyperparameters af¬ 
fects not only posterior selection probabilities, but also computational time, 
convergence rate and required iteration of MCMC simulations. We found 
that if very few predictors are selected in each iteration, the DP prior tends 
to shrink the /3’s of all predictors into one identical value, leading to very 
sticky MCMC, which offsets the computational advantage per iteration of¬ 
fered by the shrinkage effect of the DP prior. Therefore, besides avoiding 
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the two extreme ends of full selection and zero selection, the trade-off be¬ 
tween computation per iteration and convergence rate should be taken into 
consideration when choosing the hyperparameters. 

4. Posterior computation. We use a Gibbs sampler with data augmen¬ 
tation to carry out the posterior inference of the proposed model: 7 | —, /3| —, 
(t| —, where ” denotes all the rest of the parameters. Below we describe 
the outline of the Gibbs sampler but relegate the computational details to 
Appendix B. 

The procedure to update the variance a, and the indicators 7 , which 
we update one at a time in a random order in each sweep, is standard. 
To draw posterior samples of /3, we use an approximate blocked Gibbs 
sampler based on the truncated stick-breaking process [Ishwaran and Zare- 
pour (2000); Ishwaran and James (2001)]. First choose a conservative upper 
bound, H < 00 on the number of mixture components potentially occupied 
by /3j’s in the sample. Then introduce latent class indicators for each predic¬ 
tor, Zj{^ {!,...,JJ}) with a multinomial distribution, Zj ~MN(w) where 
w = {wi,... ,wh}- This associates each predictor in the current iteration 
with a cluster h in the DP. In the Gibbs sampler, we first augment the 
cluster membership Zj and then sample /3j conditional on Zj. 

The main computational gain, especially when p is large, is due to the 
clustering nature of DP: because all the predictors in one cluster share the 
same coefficient, we only need to update one (3 for each cluster within each 
iteration. It is easy to show the computational order of the posterior com¬ 
putation of one MGMC iteration under the DP prior for /3 is 0{n xpx Psei), 
where psei is the number of selected predictors (model size) in that iteration. 
For comparison, we present the corresponding computational order under 
the standard spike-and-slab prior with Gaussian prior for (5, for which there 
are two general schemes for posterior computation: (i) sample all parame¬ 
ters, /3, a and 7; and (ii) integrate out j3 and a under the conjugate setup 
and only sample 7. In both schemes, the main computational burden is due 
to the inversion of the covariance matrix, which, even using fast low-rank 
update algorithms, is of the order 0{n x p^) and 0{n xpx Pggj), respectively. 
When p is very large as in this application, the computational order of the 
first scheme is prohibitive, and this is the reason that the vast majority of 
the SSVS literature in high-dimensional settings adopts the second scheme, 
which, however, does not provide posterior samples of the coefficients (5 or 
the variance a. Moreover, because of the squared term of PseU even when 
the average model size is modest (e.g., between 50-100), the second scheme 
can still incur overwhelming computational cost. In contrast, as shown in 
the details of the Gibbs sampler in Appendices A and B, the DP prior does 
not require matrix inversion, yet still provides posterior samples of /3’s with 
much lower computational cost. 
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5. Simulations. 

5.1. Simulation design. We conduct simulations to examine the perfor¬ 
mance of the Ising-DP prior and compare with several alternative meth¬ 
ods. We simulate data of n = 104 subjects (the number of subjects in the 
real application), each having p = 1000 candidate predictors overlaying a 
10 X 10 X 10 3D grid. Each predictor j (1 < j < 1000) is spatially indexed by 
dj = for 1 < d],dj < 10. To mimic the real data, we let predic¬ 

tors be strongly correlated, and the design matrices of the ith subject Xj = 
(Wi,... ,Xip) in all the following simulations follow a multivariate normal 

MVNp(/i,S), where = (/ii,...,/ip) Unif(3, 6 ) and 
where |dj^ — djjl = — dl^j. We consider the following four simula¬ 

tion scenarios. 

Scenario 1: One cluster of true predictors, with identical /3’s. There is 
a cluster of 5 x 5 x 5 (125) true predictors {jj = 1) with spatial indices 
4 < dj, d|, d| < 8 located in the center of the 3D cube. The coefficients /3 
of the true predictors are set to 0.6. The response is generated from Yi = 
J2j with Ei ~ N(0,200^) for z = 1,... ,re, creating a data set 

with a low SNR 5%—defined as V(X/3)/V(£). The following scenarios also 
all have such a low SNR, which is the norm in real fMRI data. 

Scenario 2 : One cluster of true predictors, with varying but strongly cor¬ 
related fi’s. We let the coefficients of the true predictors, locating on the 
same grid as those in scenario 1, vary and follow MVNp(0.6 x lp,D), where 
Djijj = 0.1 X 0 . 95 l‘^-’'i“'^-’ 2 l. Therefore, both the observed values and the un¬ 
derlying coefficients of neighboring predictors are strongly correlated. 

Scenario 3: Two clusters of true predictors, with identical fTs within each 
cluster. A more challenging scenario is when there are multiple spatially 
separated clusters of true predictors. Specihcally, we let the true predictors 
form two clusters: one overlays the grid of 3 < dj < 4,3 < d^ < 4,3 < d^ < 4, 
and another overlays the grid of 6 < dj < 9 ,6 < d^ < 9 ,6 < d^ < 9. We set the 
coefficients /3 of the predictors in the two clusters to 0.4 and 1, respectively. 

Scenario 4: Two clusters of true predictors, with varying fTs within each 
cluster. The true predictors locate on the same grid as those in scenario 3, 
and one cluster of (3 were generated from MVNp(0.4 x Ip, Di) with = 

0.1 X 0.95l'^-’i 1J and those in the second cluster are from MVNp(l x Ip, D 2 ) 

with = 0.1 X 0.95l'^-’i 1. Variable selection under two-cluster scenar¬ 

ios is challenging: the strong correlation between the predictors outside and 
inside the clusters renders differentiating nonsignihcant predictors, especially 
those located between the two clusters, from the true ones difficult. 

For each of the simulated data set, we ht the regression model (1) with 
four different priors: (i) i.i.d. Bernoulli prior for 7 ^, with a Gaussian prior 
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for the (this is the standard spike-and-slab prior), referred to as the 
i.i.d.-Gaussian prior; (ii) Ising prior for 7 ^, with a Gaussian prior for the 
/3j’s, referred to as the Ising-Gaussian prior; (iii) i.i.d. Bernoulli prior for 
7 j, with a DP prior for referred to as the i.i.d.-DP prior; (iv) the 

Ising-DP prior. The hyperparameters (a, 5) for the Ising priors are chosen 
by the proposed approach in Section 3, with a = —5 and b = 0.25. For the 
DP priors, we set H = 20, a = 1 and u = 10 such that Go is very flat in a 
wide domain. For each simulated data, we run 10 parallel Gibbs samplers 
with random start in 7 , each having 20,000 iterations with the first 10,000 
ones as burn-in. Posterior computation with the i.i.d.-Gaussian and Ising- 
Gaussian priors are carried out using the software by Li and Zhang (2010). 
The main summary statistic, the posterior inclusion probability, is deemed 
convergent upon inspecting the Gelman-Rubin statistic [Gelman and Rubin 
(1992)]. In all of our experiments, the 10 simulations lead to highly similar 
posterior summary statistics. 

5.2. Simulation results. We calculate the posterior inclusion probabili¬ 
ties Pr( 7 j = 1|Y) as the posterior summary statistics, obtained by dividing 
the number of iterations where 'jj = 1 over the total number of iterations 
excluding the burn-in period. To summarize these marginal probabilities, we 
compute the ROG curve as follows: only those covariates j with Pr( 7 j = 1|Y) 
greater than a threshold are deemed positives, and those below the threshold 
are deemed negatives; the ROG curve reflects the pair of true positive rate 
and false positive rate achieved by varying the calling threshold. The big¬ 
ger area under the ROG curve (maximum 1 ), the better the discriminating 
power of the model. 

The ROG curves resulting from the simulations under scenarios 1-2 (one 
cluster) and 3-4 (two clusters) are presented in the top and bottom panel 
of Figure 1, respectively. We also calculated the root mean squared error 
(RMSE) per variable, — (3j)'^, of each prior, summarized in Ta¬ 

ble 1. In all four simulations, the Ising-DP prior resulted in the best ROC, 
closely followed by the i.i.d.-DP prior, beating both the i.i.d.-Gaussian and 
the Ising-Gaussian priors. This pattern is consistent with the RMSEs. Over¬ 
all, the ROC curves suggest relatively low discriminating power in these 
simulations, even for the best-performing Ising-DP prior. This is not sur¬ 
prising because variable selection under all four scenarios is very challenging 
due to the low SNR, strong correlation between variables and the small-n 
large-p nature. Indeed, our experience based on more simulations suggests 
that as the SNR and/or the sample size decreases, performance of all the 
priors drops, but the Ising-DP prior is the least affected, demonstrating the 
benefit of introducing additional shrinkage to the coefficients when the signal 
is weak. In summary, it is evident from these simulations that the Ising-DP 
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One cluster 




(a) Identical /3 of true predictors 


(b) Varying /? of true predictors 


Two clusters 




(c) Identical /3 of true predictors within a cluster (d) Varying P of true predictors 


Fig. 1. ROC curves based on the posterior selection probability Pr(7j = 1|Y) obtained 
from i.i.d.-Gaussian, Ising-Gaussian, i.i.d.-DP and Ising-DP prior, respectively, under 
four simulation scenarios. 


prior outperforms the existing alternatives in data with characteristics sim¬ 
ilar to those of the fMRI data under study. 

It is worth noting that in these simulations the DP component appears 
to impose a stronger clustering effect on performance than the Ising com¬ 
ponent. One reason is that, as shown in Section 3, when the degree of the 
graph is large as in the 3D fMRI analysis, the hyperparameter b in the Ising 
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Table 1 

Root mean squared error (RMSE) per variable, — Pi)'^, by different priors 


Scenario 

I.i.d.-Gaussian 

Ising-Gaussian 

I.i.d.-DP 

Ising-DP 

1. One-cluster identical /3 

0.623 

0.599 

0.190 

0.190 

2. One-cluster varying /3 

0.284 

0.283 

0.181 

0.179 

3. Two-cluster identical /3 

0.311 

0.315 

0.256 

0.250 

4. Two-cluster varying /3 

0.368 

0.251 

0.235 

0.233 


prior used to control the clustering effect has to be set small to avoid phase 
transition, which consequently limits its clustering effect. Nevertheless, the 
simulation results suggest that incorporating the spatial information into 
Bayesian variable selection via the Ising prior still leads to improved selec¬ 
tion accuracy than otherwise. 

6. Application to the KLIFF study. 

6.1. The data. We now provide more information on the design of the 
KLIFF study and the preprocessing procedure. For each of the 104 pairs of 
participants in a close relationship (referred to as partners hereafter), one 
of them was randomly selected to be threatened by electric shocks while 
their brain activities were measured by fMRI in three separate sessions: in 
one session he/she is holding hands with his/her partner; in the second ses¬ 
sion, he/she is holding hands with a stranger; in the third session, he/she 
is alone, holding hands with nobody at all. The three hand-holding condi¬ 
tions mimic three types of social interactions. Each of the three sessions, 
randomized within each pair of partners, contains 24 trials in random order, 
half of which are threat cues (a red “X” on a black background) indicating 
a 20% likelihood of receiving an electric shock to the ankle, and the other 
half are safety cues (a blue “O” against a black background) indicating no 
chance of shock. A 3D fMRI scan of the subject’s brain was acquired for 
every 2 seconds in the experiment lasting for 400 s. Overall, fMRI data col¬ 
lected from the KLIFF experiment consist of 104 subjects in 3 sessions at 
200 time points for over 100,000 spatially distributed voxels. At the end of 
each session, the subjects facing the threat were asked to score their arousal 
and valence feelings experienced during the experiment. Both the arousal 
and valence measurements range from 1 to 9, encoding feelings from calm¬ 
ing/soothing to alert/agitated, and feelings from highly negative/miserable 
to highly positive/pleased, respectively. 

Preprocessing of the fMRI data was carried out via FMRIB’s Software 
Library (FSL) software [Version 5.98; Smith et al. (2004)]. Registration of 
the images in FLIRT [Jenkinson et al. (2002)] was based on Montreal Neu¬ 
rological Institute (MNI) space. More details of preprocessing can be found 
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in Zhang et al. (2013). ROIs were determined structurally using the Har¬ 
vard subcortical brain atlas, and were chosen for their likely involvement 
in affective processing based on previous studies [Maresh, Beckes and Coan 
(2013)]. In particular, our analysis focuses on two emotion related regions: 
dorsal anterior cingulate cortex (dACC) and insula, which were commonly 
implicated in negative affect and threat responding, and whose numbers of 
voxels are similar, 6666 and 6591, respectively. To obtain the predictors, we 
conducted massive univariate analysis using the GLM to get scalar sum¬ 
maries of the fMRI time series. Specifically, for every voxel in each ROI, we 
used the semi-parametric GLM approach in Zhang et al. (2013) to estimate 
the hemodynamic response functions (HRF) corresponding to the threat 
and safety cues (stimuli), and extracted the height of the HRF estimates, 
interpreted as the magnitude of brain response to the stimuli of that voxel. 
We then computed the difference between the estimated magnitudes under 
the threat cue and the safety cue (baseline) for each voxel as the predictors. 
In total, for each ROI, we obtained six sets of regression data: two different 
response variables—valence and arousal scores of the subjects, under each 
of the three hand-holding conditions, and associated magnitude estimates 
of each voxel in the ROI collected in the same session as the predictors. 

6.2. Results. We applied the proposed Bayesian model to the 12 sets 
of data (6 for each ROI) using the Ising-DP prior on a 3D lattice with 
hyperparameter a = —4.5 and b = 0.2 obtained from the method in Sec¬ 
tion 3. For comparison, we also fit the model with the i.i.d.-Gaussian and 
the Ising-Gaussian priors. For each regression, 25,000 iterations of MGMG 
were performed with the first 5000 discarded as burn-in. Gonvergence of the 
marginal inclusion probabilities is deemed via the Gehnan-Rubin statistics. 

Though the number of selected predictors is larger than the sample size in 
each MGMG iteration, the clustering effect of the DP prior leads to a small 
number of different (3 values (less than 10) in most iterations. Among the 12 
sets of regressions, we focused on those with (i) reasonably high R-squared 
values and (ii) top 10 % selected voxels having a high proportion of nonzero 
coefficients with the same sign. The R-squared value for each iteration t is 
given by 

Rj = l- Var(Y - Xt* • /3t)/Var(Y), 

where 7 ^ and are the posterior draws of 7 and /3, respectively, at the tth 
iteration. The first criterion requires that a significant proportion of variation 
of subjects’ emotion measurements can be explained by their brain response 
magnitudes, and the second requires that the majority of the top selected 
predictors have similar and significant effects on the response, matching the 
substantive knowledge from the existing psychology literature. We found 
three sets of regressions fit these two criteria: the regression with the arousal 


SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 


17 



0.0 OJ 02 OJ 04 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 

(a) dAGC alone arousal (b) Insula alone arousal (c) Insula partner valence 


Pig. 2. R-squared values of the regressions. 


measurement under alone condition as the response in dACC and insula, 
respectively, and the regression with the valence measurement under hand- 
holding-with-partner condition as the response in insula. 

Histograms of the R-squared values and the coefficients of the top 10% 
selected voxels in these three regressions are displayed in Figures 2 and 3, 
respectively. We can see that in the regression with arousal under the alone 
condition as the response in dACC, the R-squared value is larger than 20% in 
more than 20% of the MCMC draws [Figure 2(a)], and almost all (>99.5%) 
of the top 10% selected voxels’ coefficients are positive in more than 90% 
of the posterior draws [Figure 3(a)]. The same regression in insula led to 
similar results [R-squared in Figure 2(b) and coefficients in Figure 3(b)]. 
The significant positive association between the arousal measurement and 
brain response magnitudes under the alone condition is consistent with re¬ 
lated hndings in the literature. First, in a previous study of the KLIFF data 
[Zhang et al. (2013)], we found that the brain response to threat stimulus 
is most active when subjects are alone. This phenomenon can be explained 



-1-1-1- I-1-1-1 I-1 -1-1-1-1— 

2 2.5 3 -0.5 0 0.5 1 1.5 2 -2.55 -2.5 -2.45 -2.4 

(a) 10% percentile (b) 10% percentile (c) 90% percentile 


dAGC arousal insula arousal insula valence 

Pig. 3. Histograms of 10% or 90% percentile of the coefficients (in scale 10~^) of the 
top 10% selected voxels in dACC and insula when regressing subjects’ arousal (the first 
two figures) or valence (the third figure) scores versus the magnitude of brain response to 
threat under the alone or hand-holding-with-partner condition. 
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through the social baseline theory [Beckes and Coan (2011); Coan, Beckes 
and Allen (2013); Coan and Maresh (2014)], which suggests that the human 
brain assumes proximity to other human beings, and perceives the envi¬ 
ronment as less threatening during the presence of other people in a close 
relationship, and thus serving as a default, or baseline, strategy of emotion 
regulation. This reduces the need to rely on effortful self-regulation in re¬ 
sponse to threat. On the other hand, when the subjects are alone without 
any social support, their brains have to use their own energy for emotion 
regulation, and, consequently, their emotional response is strong, and its 
association with subjects’ emotion measurements is easier to detect in the 
two emotion-related ROIs. Second, the positive association between brain 
response and excitement level corresponds with literature showing a role 
for dACC and insula in both cognitively- and physically-induced arousal 
[Critchley et al. (2000); Lewis et al. (2007)]. Since the use of electric shock 
as a threat stimulus causes physical pain and induce subjects’ internal aware¬ 
ness of upcoming pain during anticipation of a shock particularly, it is nat¬ 
ural that the more active emotion-related ROIs process the stimulus, the 
more intense and agitated feeling the subjects experience. 

We also found significant association between valence and brain response 
magnitude in insula under hand-holding-with-partner condition [R-squared 
values shown in Figure 2(c) and coefficients shown in Figure 3(c)]. The neg¬ 
ative association has two possible explanations. First, the threat stimulus 
induces subjects’ negative feelings, and the valence and arousal measures 
are negatively correlated, therefore, the more active the brain responds to 
the stimulus, the less pleased the subjects’ feelings. Second, according to the 
social baseline theory, humans feel less threatened under the hand-holding- 
with-partner condition. Thus, subjects’ emotion variation is more likely to 
occur in the valence dimension. We indeed found that the variance of sub¬ 
jects’ valence is larger than that of arousal. Moreover, insula is thought to 
mediate the awareness of internal bodily and emotional states [Craig (2009)] 
and is related to pain anticipation and intensity [Wiech, Ploner and Tracey 
(2008)]. Results of the regression under the hand-holding-with-stranger con¬ 
dition are not as stable as the other two regressions, possibly due to the 
individual differences in cognitive and affective perception of strangers. 

In all three regressions, the largest posterior selection probabilities of vox¬ 
els are around 0.1, and the majority of the probabilities are below 0.05. This 
is as expected given the very low SNR common in fMRI data. In these sit¬ 
uations, arguably, the ranks rather than absolute value of the probabilities 
are more informative about the selection results. Figures 4, 5 and 6 show the 
heatmaps of the posterior selection probabilities of the voxels in three slices 
based on their rank under the Ising-DP (top panel) in these regressions, 
respectively, in comparison to the corresponding heatmaps under the i.i.d.- 
Gaussian (middle panel) and the Ising-Gaussian prior (bottom panel). The 
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Ising-DP 



(a) Z=46 


(b) Z=47 
Ising-Gaussian 


(c) Z=48 



(d) Z=46 


(e) Z=47 (f) Z=48 

iid-Gaussian 



(g) Z=46 


(h) Z=47 


(i) Z=48 
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I I 

0.6 0.8 1.0 

Pig. 4. Heatmaps of voxels according to the ranks of their posterior inclusion probabilities 
obtained from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, respectively, in the 
Bayesian regression of subjects ’ arousal scores versus the magnitude of brain response to 
threat of voxels in dAGC and insula when subjects are alone. 



color scale is arbitrary, with dark red representing the selection probability 
in the lowest rank and light yellow representing the highest rank. The most 
striking pattern from these graphs is that the areas with the highest selection 
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Ising-DP 



(a) Z=23 


(b) Z=24 (c) Z=25 

Ising-Gaussian 



(d) Z=23 


(e) Z=24 
iid-Gaussian 


(f) Z=25 



(g) Z=23 


(h) Z=24 


(i) Z=25 
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Fig. 5. Heatmaps of voxels according to the ranks of their posterior inclusion probabilities 
obtained from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, respectively, in the 
Bayesian regression of subjects ’ arousal scores versus the magnitude of brain response to 
threat of voxels in insula when subjects are alone. 

probabilities identified by the Ising-DP prior were smoothly located across 
the ROIs, matching the scientific understanding of human brain functions, 
in contrast to those by the i.i.d.-Gaussian or the Ising-Gaussian prior, which 
are very diffused and scattered across the entire region. 
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Ising-DP 



(a) Z=23 (b) Z=24 (c) Z=25 

Ising-Gaussian 



(d) Z=23 (e) Z=24 (f) Z=25 

iid-Gaussian 



(g) Z=23 (h) Z=24 (i) Z=25 


Rank/p 

0.6 0.8 1.0 

Pig. 6. Heatmaps of voxels according to the ranks of their posterior inclusion probabilities 
obtained from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, respectively, in the 
Bayesian regression of subjects ’ valence scores versus the magnitude of brain response to 
threat of voxels in insula when subjects are hand holding with their partners. 

Since the underlying truth is unknown, we use a simulation-based pro¬ 
cedure to obtain the sampling distribution of the R-squared values of a 
null model. Specifically, we simulated, independently of the covariates, a 
normally distributed response variable with similar variance and range as 
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(a) Histogram of R-squared (b) Heatmap of selection (c) Voxels with highest top 10% 

probabilities of voxels selection probabilities 

Fig. 7. Regression of simulated response versus brain activity measurements in dACC 
under alone condition. 

the observed emotion measurements, and applied the Bayesian model to 
regress the simulated outcome on the observed covariates in dACC under 
the alone condition. The histogram of positive R-squared values in the pos¬ 
terior draws of this null model, shown in Figure 7(a), centers around zero, 
and is distinct from the histograms from the aforementioned three regres¬ 
sions, each of which has a much higher proportion of large R-squared values. 
In contrast, the histogram of the null model is very similar to those from the 
remaining nine regressions. As such, we deem there is no statistically sig¬ 
nificant association between the covariates and the responses in these nine 
regressions. 

7. Discussion. Motivated by the KLIFF hand-holding experiment, in 
this article we propose a joint Ising-DP prior within the Bayesian SSVS 
framework to achieve selection and grouping of spatially correlated vari¬ 
ables in high-dimensional SI regression models. We developed an analytic 
approach for deriving the bounds of the hyperparameters to avoid phase 
transition, a main challenge in methods involving the Ising prior. Though 
the bounds provided by our method are tighter than the previous mean field 
bounds, they are still only ballpark estimates and may be wide in graphs 
with high degrees. A focus of our future research is therefore to improve the 
method of hyperparameter selection for a more complex graphical structure. 

A major challenge to MCMC-based Bayesian methods in high-dimensional 
settings is computation. Though the DP prior in our model partially reduces 
the computational load by clustering the coefficients, computational scalabil¬ 
ity remains a challenge given the large p. Indeed, currently we are not able to 
perform a whole brain analysis with 100,000. Moreover, the mixing rate 
of the MCMC of the standard strategy in SSVS of updating one variable at 
a time may be slow, especially when the DP prior is involved. An attractive 
direction is to design a block update Gibbs sampling scheme that updates 
multiple variables at a time, and to parallelize the computation within a 
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block using graphics processing unit (GPU)-based programming [Suchard 
et al. (2010); Ge et al. (2014)]. The procedure can be further speeded up by 
carefully selecting the block so that it matches the underlying block struc¬ 
ture. 

The Ising prior is a special case of Markov random fields. Kalus, Samann 
and Fahrmeir (2014) proposed latent GMRFs via a probit model. The probit- 
GMRF prior simplifies the calculation of the hyperparameters and does not 
suffer from the phase transition behavior. However, the main computational 
hurdle of inversion of a matrix of the size of selected variables remains. 
Nevertheless, it is possible to combine the DP prior with the probit-GMRF 
prior to reduce the computation. 

Extension to binary and categorical responses is, in principle, straightfor¬ 
ward using generalized linear models. Computation is an increased focus, as 
closed-form posterior conditional distributions are no longer available. The 
same problem applies with censored survival models. Laplace approxima¬ 
tions [Raftery (1996)] are useful, but they usually require gradient methods 
for iterative computation of posterior modes for each sweep of covariates. 
A possible improvement can be obtained by exploiting the majorization- 
minimization/maximization (MM) algorithm [Lange (2008)]— a generalized 
version of the EM algorithm—for within-model mode computations. 

The proposed Ising-DP prior inherently assumes sparsity, that is, only 
a small portion of the voxels in the ROIs are associated with the indi¬ 
vidual scalar outcome. This is achieved via a point mass (spike-and-slab) 
prior for the regression coefficients, resulting in a “hard-thresholding” of 
the /3’s. However, in our real application, posterior probabilities of inclu¬ 
sion of nearly all voxels are relatively small, which suggests that an alter¬ 
native “soft-thresholding” without sparsity—achieved by (spatial adaption 
of) LASSO-type priors [Park and Casella (2008)] —may be desirable and a 
worthwhile direction for future investigation. 

Though we have focused on fMRI, the proposed model is applicable to 
other imaging modalities where detailed spatial information between covari¬ 
ates is available, such as DTI or MRI. 

Matlab code that implements the method is available at http://faculty. 
virginia.edu / tingtingzhang/Software.html . 

APPENDIX A: CALGULATION OF a '7 -L yB 7 

1. Two-dimensional square. For V‘^{V > 1) voxels on a square, the {V — 
2)^ voxels in the center all have 8 neighbors, the 4 vertex voxels have 3 
neighbors, and the 4 • (H — 2) voxels on the edge but not vertexes have 5 
neighbors. Then, given a and B as defined in Section 2, we have 

a '7 + 7 'B 7 = o • + 6 • (8 • (H - 2)2 + 4 • 3 + 5 • 4 • (H - 2)) 

= (a-L 86)^2 _ l2bV + 4b. 
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2. Three-dimensional cube. For {V > 1) voxels in a cube, the {V — 2)^ 
voxels in the center all have 26 neighbors, the 8 voxels on the vertex have 
7 neighbors, the \2{V — 2) voxels on the edge but not vertexes have 11 
neighbors, and the 6(17 — 2 )^ voxels on the 6 outside faces of the cube but 
not on the edges have 17 voxels. Then, given a and B as defined in Section 2, 
we have 

a '7 + 7'B7 

= a-V^ + b- {26{V - 2)3 + 8 • 7 + 12{V - 2) • 11 + 6{V - 2f ■ 17) 

= (a + 2Qb){V - 2)3 + 6 (a + nb){V - 2f + I2{a + llb){V - 2) 

+ 8 a + 566. 

APPENDIX B: POSTERIOR DISTRIBUTIONS IN THE GIBBS 

SAMPLER 

1. Update 7 . We update the indicator for one voxel 7 ^ at a time. Let 

7 (-j) = {tu ^ I(- 3 ) indices {7 = 1 H / j}, = {^i: I ^ 

j}, and X (_7 be the design matrix corresponding to /3(-j)- The prior prob¬ 
ability of jj = 1, Pr( 7 j = l| 7 (-i)) is exp(a -L bY,i^i^_.^7 i)/{^ + exp(a -L 

■) 7*))- By Bayes rule, the posterior probability of jj = 1 given 
the data and other parameters is 

Pr( 7 j = l|7(_j),/3,a,Y) 

^_ Pr(7i = l|7(-j)) _ 

Pr(7j = l|7(-j)) + E(j|7(_j))-i • Pr(7j = 0|7(_7) ’ 

where /3-7 denotes the dot product between and 7 , and E(j| 7 (_j)) is the 
Bayes factor, 

Pr(Y|7, = 1.7(_,).ft^) 

( 3 ) Pr(Y| 7 j =0,7 (_j),/3,cj) 

exp{- Er=i(^« - • 7 (-j))72c^^} ’ 

where Xj is the ith row of matrix 

2 . Update cr^. it^|— ~ Inv-Gamma(n/2,/ig-), where — Xj/3 • 

7)72. 

3. Update /3. Denote the /3j’s in Zj = hhy /3^, and let X 6 = Zj=h^U- 

Note that X 6 = 0 if {j : jj = 1, Zj = h} = 0. Also, let = {fij : Zj / h}, 
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7^“^) = {tj : Zj / h} and = {Xj : Zj / h}, respectively, denote the col¬ 

lection of all the /3’s and the design matrix of the covariates not in cluster 
h. Then for h = 1,..., H, 


with + and 

S^. This part can be parallelized (across h). 

The posterior cluster membership Z is drawn from a multinomial distri¬ 
bution with 


Fr{Zj = h\-fj = 1,-) 


Wfeexpj- J27=i(Xi - ^il^(jh) • 70fe))V2o-^} 
EfcLi ^nfcexp{- - ^iP{jk) • 7ofc))V2T2} ’ 


Pr(Zj =h\'yj = 0,-) = Wh, 

where = {/3i, ■ ■ ■, Pj-i, Pj+i, ■ ■ ■, Pp) a.nd'Y(jh) = ihi, ■ ■ ■ ,'lj-iA,hj+i, 

..., 7 p) for h = 1,..., H and j = 1,... ,p. To update the associated weights 
•w, first set w'pp = 1 and draw w'f^ from Beta(l -|- Ylj- z =h Ij o: -|- 
for each /i G {1,..., i/ — 1}, then update Wh = w'j^ nA:</i(^ “ 
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SUPPLEMENTARY MATERIAL 

Heatmaps (DOL 10.1214/15-AOAS818SUPP; .pdf). We provide the heat- 
maps of the voxels with top 10% highest posterior selection probabilities 
obtained, resulting from Ising-DP, Ising-Gaussian and i.i.d.-Gaussian priors, 
respectively, in three regressions [Li et al. (2015)]. 


REFERENGES 

Allen, J. P., Porter, M., McEarland, F. C., McElhaney, K. B. and Marsh, P. 
(2007). The relation of attachment security to adolescents’ paternal and peer relation¬ 
ships, depression, and externalizing behavior. Child Development 78 1222-1239. 

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian 
nonparametric problems. Ann. Statist. 2 1152-1174. MR0365969 

Beckes, L. and Goan, J. A. (2011). Social baseline theory: The role of social proximity in 
emotion and economy of action. Social and Personality Psychology Compass 5 976-988. 



26 


F. LI ET AL. 


Bowman, F. D. (2007). Spatiotemporal models for region of interest analyses of functional 
neuroimaging data. J. Amer. Statist. Assoc. 102 442-453. MR2370845 
Bowman, F. D., Caffo, B., Bassett, S. S. and Kilts, C. (2008). A Bayesian hierar¬ 
chical framework for spatial modeling of fMRI data. Neuroimage 39 146-156. 
Bradley, M. M. and Lang, P. J. (1994). Measuring emotion: The self-assessment 
mankin and the semantic differential. J. Behav. Ther. Exp. Psychiatry 25 49-59. 
COAN, J. A. (2010). Adult attachment and the brain. J. Soc. Pers. Relatsh. 27 210-217. 
COAN, J. A. (2011). The social regulation of emotion. In Oxford Handbook of Social 
Neuroscience 614-623. Oxford Univ. Press, New York. 

CoAN, J. A., Beckes, L. and Allen, J. P. (2013). Childhood maternal support and 
social capital moderate the regulatory impact of social relationships in adulthood. Int. 
J. Psychophysiol. 88 224-231. 

COAN, J. A. and Maresh, E. L. (2014). Social baseline theory and the social regulation 
of emotion. In The Handbook of Emotion Regulation, 2nd ed. (J. GROSS, ed.) 221-236. 
The Guilford Press, New York. 

CoAN, J. A., Schaefer, H. S. and Davidson, R. J. (2006). Lending a hand: Social 
regulation of the neural response to threat. Psychol. Sci. 17 1032-1039. 

Craig, A. D. (2009). How do you fell now? The anterior insula and human awareness. 
Nat. Rev. Neurosci. 10 59-70. 

Critchley, H. D., Corfield, D. R., Chandler, M. P., Mathias, C. J. and 
Dolan, R. J. (2000). Cerebral correlates of autonomic cardiovascular arousal: A func¬ 
tional neuroimaging investigation in humans. J. Physiol. (Land.) 523 259-270. 
Derado, G., Bowman, F. D. and Kilts, C. D. (2010). Modeling the spatial and temporal 
dependence in fMRI data. Biometrics 66 949-957. MR2758231 
Dunson, D. B., Herring, A. H. and Engel, S. M. (2008). Bayesian selection and 
clustering of polymorphisms in functionally related genes. J. Amer. Statist. Assoc. 103 
534-546. MR2523991 

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. 
Statist. 1 209-230. MR0350949 

Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. Ann. 
Statist. 2 615-629. MR0438568 

Friston, K. j.. Holmes, A. P., Worsley, K., Poline, P. J., Frith, C. and Frack- 
OWIAK, R. (1995). Statistical parametric maps in functional imaging: A general linear 
approach. Hum. Brain Mapp. 2 189-210. 

Ge, T., Muller-Lenke, N., Bendfeldt, K., Nichols, T. E. and Johnson, T. D. 
(2014). Analysis of multiple sclerosis lesions via spatially varying coefficients. Ann. 
Appl. Stat. 8 1095-1118. MR3262547 

Gelman, a. E. and Rubin, D. B. (1992). Inference from iterative simulation using mul¬ 
tiple sequences. Statist. Sci. 7 457-472. 

George, E. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. 
Amer. Statist. Assoc. 88 881-889. 

George, E. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. 
Statist. Sinica 7 339-373. 

Goldsmith, J., Huang, L. and Crainiceanu, C. M. (2014). Smooth scalar-on-image 
regression via spatial Bayesian variable selection. J. Comput. Graph. Statist. 23 46-64. 
MR3173760 

Gossl, C., Auer, D. P. and Fahrmeir, L. (2001). Bayesian spatiotemporal inference in 
functional magnetic resonance imaging. Biometrics 57 554-562. MR1855691 


SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 


27 


Huang, L., Goldsmith, J., Reiss, P. T., Reich, D. S. and Crainiceanu, C. M. (2013). 
Bayesian scalar-on-image regression with application to association between intracranial 
DTI and cognitive outcomes. Neuroimage 83 210-223. 

ISHWARAN, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. 
J. Amer. Statist. Assoc. 96 161-173. MR1952729 

ISHWARAN, H. and Zarepour, M. (2000). Markov chain Monte Carlo in approximate 
Dirichlet and beta two-parameter process hierarchical models. Biometrika 87 371-390. 
MR1782485 

Jenkinson, M., Bannister, P., Brady, M. and Smith, S. (2002). Improved optimization 
for the robust and accurate linear registration and motion correction of brain images. 
Neuroimage 17 825-841. 

Johnson, T. D., Liu, Z., Bartsch, A. J. and Nichols, T. E. (2013). A Bayesian non- 
parametric Potts model with application to pre-snrgical FMRI data. Stat. Methods 
Med. Res. 22 364-381. MR3190664 

Kalus, S., Samann, P. G. and Fahrmeir, L. (2014). Classification of brain activation 
via spatial Bayesian variable selection in fMRI regression. Adv. Data Anal. Classif. 8 
63-83. MR3168680 

Kang, J., Johnson, T. D., Nichols, T. E. and Wager, T. D. (2011). Meta analysis 
of functional neuroimaging data via Bayesian spatial point processes. J. Amer. Statist. 
Assoc. 106 124-134. MR2816707 

Kim, S., Tadesse, M. G. and Vannucci, M. (2006). Variable selection in clustering via 
Dirichlet process mixture models. Biometrika 93 877-893. MR2285077 

Lang, P. J., Greenwald, M. K., Bradley, M. M. and Hamm, A. O. (1993). Looking 
at pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology 30 
261-273. 

Lange, K. (2008). Optimization. Springer Texts in Statistics 95 . Springer, New York. 

Lewis, P. A., Critchley, H. D., Rotshtein, P. and Dolan, R. J. (2007). Neural 
correlates of processing valence and arousal in affective words. Cereb. Cortex 17 742- 
748. 

Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high¬ 
dimensional covariate spaces with applications in genomics. J. Amer. Statist. Assoc. 
105 1202-1214. MR2752615 

Li, F., Zhang, T., Wang, Q., Gonzalez, M., Maresh, E. L. and Goan, J. A. (2015). 
Supplement to “Spatial Bayesian variable selection and grouping for high-dimensional 
scalar-on-image regression.” DOL10.1214/15-AOAS818SUPP. 

Maresh, E. L., Beckes, L. and Goan, J. A. (2013). The social regulation of threat- 
related attentional disengagement in highly anxious individuals. Front. Human Neu- 
rosci. 7 515. 

Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear 
regression. J. Amer. Statist. Assoc. 83 1023-1036. MR0997578 

Park, T. and Casella, G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc. 103 
681-686. MR2524001 

Penny, W. D., Trujillo-Barreto, N. J. and Friston, K. J. (2005). Bayesian fMRI 
time series analysis with spatial priors. Neuroimage 24 350-362. 

Raftery, a. E. (1996). Approximate Bayes factors and accounting for model uncertainty 
in generalised linear models. Biometrika 83 251-266. MR1439782 

Reiss, P. T., Mennes, M., Petkova, E., Huang, L., Hoptman, M. J., Biswal, B. B., 
COLCOMBE, S. J., Zuo, X.-N. and Milham, M. P. (2011). Extracting information 
from functional connectivity maps via function-on-scalar regression. Neuroimage 56 
140-148. 


28 


F. LI ET AL. 


Reiss, P. T., Huo, L., Zhao, Y., Kelly, C. and Ogden, R. T. (2015). Wavelet-domain 
regression and predictive inference in psychiatric neuroimaging. Ann. Appl. Stat. 9 
1076-1101. 

Russell, J. (1980). A circumplex model of affect. J. Pers. Soc. Psychol. 39 1161-1178. 

Sethuraman, j. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 
639-650. MR1309433 

Smith, M. and Fahrmeir, L. (2007). Spatial Bayesian variable selection with applica¬ 
tion to functional magnetic resonance imaging. J. Amer. Statist. Assoc. 102 417-431. 
MR2370843 

Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable se¬ 
lection. J. Econometrics 75 317-343. 

Smith, M., Putz, B., Auer, D. and Fahrmeir, L. (2003). Assessing brain activity 
through spatial Bayesian variable selection. Neuroimage 20 802-815. 

Smith, S. M., Jenkinson, M., Woolrich, M. W., Beckmann, C. F., 
Behrens, T. E. J., Johansen-Berg, H., Bannister, P. R., De Luca, M., Drob- 
NJAK, L, Flitney, D. E., Niazy, R., Saunders, J., Vickers, J., Zhang, Y., De 
Stefano, N., Brady, J. M. and Matthews, P. M. (2004). In advances in functional 
and structural MR image analysis and implementation as FSL. Neuroimage 23(S1) 
208-219. 

Stanley, H. E. (1987). Introduction to Phase Transitions and Critical Phenomena. Ox¬ 
ford Univ. Press, New York. 

Stingo, F. C., Chen, Y. A., Tadesse, M. G. and Vannucci, M. (2011). Incorporat¬ 
ing biological information into linear models: A Bayesian approach to the selection of 
pathways and genes. Ann. Appl. Stat. 5 1978-2002. MR2884929 

Suchard, M. a., Wang, Q., Chan, C., Frelinger, J., Cron, A. and West, M. (2010). 
Understanding GPU programming for statistical computation: Studies in massively 
parallel massive mixtures. J. Comput. Graph. Statist. 19 419-438. MR2758309 

Tadesse, M. G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clus¬ 
tering high-dimensional data. J. Amer. Statist. Assoc. 100 602-617. MR2160563 

Vannucci, M. and Stingo, F. G. (2011). Bayesian models for variable selection that in¬ 
corporate biological information. In Bayesian Statistics 9 (J. Bernardo, M. Bayarri, 
J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, eds.) 659-678. Ox¬ 
ford Univ. Press, Oxford. MR3204022 

West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. In 
Bayesian Statistics 1 (Tenerife, 2002) (J. M. Bernardo, J. O. Berger, A. P. Dawid, 
and A. F. M. Smith, eds.) 733-742. Oxford Univ. Press, New York. MR2003537 

WiECH, K., Ploner, M. and Tracey, I. (2008). Neurocognitive aspects of pain percep¬ 
tion. Trends Cogn. Sci. 12 306-313. 

Woolrich, M. W., Jenkinson, M., Brady, J. M. and Smith, S. M. (2004). Fully 
Bayesian spatio-temporal modeling of fMRI data. IEEE Trans. Med. Imag. 23 213- 
231. 

Yue, Y. R., Lindquist, M. A. and Loh, J. M. (2012). Meta-analysis of functional 
neuroimaging data using Bayesian nonparametric binary regression. Ann. Appl. Stat. 6 
697-718. MR2976488 

Zhang, T., Li, F., Beckes, L. and Goan, J. A. (2013). A semi-parametric model of the 
hemodynamic response for multi-subject fMRI data. Neuroimage 75 136-145. 


SPATIAL BAYESIAN VARIABLE SELECTION AND GROUPING 


29 


F. Li 

Q. Wang 

Department of Statistical Science 
Duke University 

Durham, North Carolina 27708-0251 

USA 

E-mail: fii@stat.duke.edu 

quanli@stat. duke.edu 

M. Z. Gonzalez 
E. L. Maresh 
J. A. Goan 
Department of Psychology 
University of Virginia 
Charlottesville, Virginia 22904 
USA 

E-mail: mzg7uv@virginia.edu 
elm2cg@virginia.edu 
jcoan@virginia.edu 


T. Zhang 

Department of Statistics 
University of Virginia 
Charlottesville, Virginia 22904 
USA 

E-mail: tz3b@virginia.edu 


